Heterogeneity-aware Fault Tolerance using a Self-Organizing Runtime System
نویسندگان
چکیده
Due to the diversity and implicit redundancy in terms of processing units and compute kernels, off-the-shelf heterogeneous systems offer the opportunity to detect and tolerate faults during task execution in hardware as well as in software. To automatically leverage this diversity, we introduce an extension of an online-learning runtime system that combines the benefits of the existing performance-oriented task mapping with task duplication, a diversity-oriented mapping strategy and heterogeneity-aware majority voter. This extension uses a new metric to dynamically rate the remaining benefit of unreliable processing units and a memory management mechanism for automatic data transfers and checkpointing in the host and device memories.
منابع مشابه
CAFT: Cost-aware and Fault-tolerant routing algorithm in 2D mesh Network-on-Chip
By increasing, the complexity of chips and the need to integrating more components into a chip has made network –on- chip known as an important infrastructure for network communications on the system, and is a good alternative to traditional ways and using the bus. By increasing the density of chips, the possibility of failure in the chip network increases and providing correction and fault tol...
متن کاملCheckpoint Based Recovery Aware Component System in Grid Computing
Grids are distributed systems that dynamically coordinate a large number of heterogeneous resources to execute large scale projects involving collaborating teams of scientists, high performance computers, massive data stores, high bandwidth networking, and/or scientific instruments like telescopes, and synchrotrons. Failure in grids is arguably inevitable due to the massive scale and the hetero...
متن کاملIncreasing the Robustness of Self-Organizing Systems By Means of Trust Practices
Self-organizing systems are becoming increasingly complex in their organisational structures, especially when unknown heterogeneous entities might arbitrarily enter and leave the network at any time. Therefore, new ways have to be found to develop and manage them. One way to overcome this issue is trust. Using appropriate trust mechanisms, entities in the system can have an indication about whi...
متن کاملTunable Fault Tolerance for Runtime Reconfigurable Architectures
Fault tolerance is becoming an increasingly important issue, especially in mission-critical applications where data integrity is a paramount concern. Performance, however, remains a large driving force in the market place. Runtime reconfigurable hardware architectures have the power to balance fault tolerance with performance, allowing the amount of fault tolerance to be tuned at run-time. This...
متن کاملReliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)
Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1405.2912 شماره
صفحات -
تاریخ انتشار 2014